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Client/Server Architecture for Text-to-Speech Synthesis 

CROSS-REFERENCE TO RELATED APPLICATIONS 

This application claims the benefit of U.S. Provisional Application No. 60/199,292, filed 
10 4/24/2000, which is herein incorporated by reference. 

FIELD OF THE INVENTION 
This invention relates generally to text-to-speech synthesis. More particularly, it relates to a 
client/server architecture for very high quality and efficient text-to-speech synthesis. 
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BACKGROUND ART 
Text-to-speech (TTS) synthesis systems are useful in a wide variety of applications such as 
automated information services, auto-attendants, avatars, computer-based instruction, and 
computer systems for the vision impaired. An ideal system converts a piece of text into high- 
20 quality, natural-sounding speech in near real time. Producing high-quality speech requires a 
large number of potential acoustic units and complex rules and exceptions for combining the 
units, i.e., large storage capability and high computational power. 

A prior art text-to-speech system 10 is shown schematically in FIG. 1. An original piece of 
25 text is converted to speech by a number of processing modules. The input text specification 
usually contains punctuation, abbreviations, acronyms, and non-word symbols. A text 
normalization unit 12 converts the input text to a normalized text containing a sequence of 
non- abbreviated words only. Most punctuation is useful in suggesting appropriate prosody, 
and so the text normalization unit 12 filters out punctuation to be used as input to a prosody 
30 generation unit 16. Other punctuation is extraneous and filtered out completely. 
Abbreviations and acronyms are converted to their equivalent word sequences, which may or 
may not depend on context. The most complex task of the text normalization unit 12 is to 
convert symbols to word sequences. For example, numbers, currency amounts, dates, times, 
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and email addresses are detected, classified, and then converted to text that depends on the 
symbol's position in the sentence. 

The normalized text is sent to a pronunciation unit 14 that first analyzes each word to 
5 determine its simplest morphological representation. This is trivial in English, but in a 
language in which words are strung together (e.g., German), words must be divided into base 
words and prefixes and suffixes. The resulting words are then converted to a phoneme 
sequence or its pronunciation. The pronunciation may depend on a word's position in a 
sentence or its context (i.e., the surrounding words). Three resources are used by the 
10 pronunciation unit 14 to perform conversion: letter-to-sound rules; statistical representations 
that convert letter sequences to most probable phoneme sequences based on language 
statistics; and dictionaries, which are simple word/pronunciation pairs. Conversion can be 
_ performed without statistical representations, but all three resources are preferably used. 

,S Rules can distinguish between different pronunciations of the same word depending on its 

fi|15 context. Other rules are used to predict pronunciations of unseen letter combinations based 
T\ on human knowledge. Dictionaries contain exceptions that cannot be generated from rules or 

y statistical methods. The collection of rules, statistical models, and dictionary forms the 

]fj database needed for the pronunciation unit 14. This database is usually quite large in size, 

particularly for high-quality text-to-speech conversion. 

O20 

l5j The resulting phonemes are sent to the prosody generation unit 16, along with punctuation 

=P extracted from the text normalization unit 12. The prosody generation unit 16 produces the 

r: timing and pitch information needed for speech synthesis from sentence structure, 

punctuation, specific words, and surrounding sentences of the text. In the simplest case, pitch 
25 begins at one level and decreases toward the end of a sentence. The pitch contour can also be 
varied around this mean trajectory. Dates, times, and currencies are examples of parts of a 
sentence that are identified as special pieces; the pitch of each is determined from a rule set or 
statistical model that is crafted for that type of information. For example, the final number in 
a number sequence is almost always at a lower pitch than the preceding numbers. The 
30 rhythms, or phoneme durations, of a date and a phone number are typically different from 
each other. Usually a rule set or statistical model determines the phoneme durations based on 
the actual word, its part of the sentence, and the surrounding sentences. These rule sets or 
statistical models form the database needed for this module; for the more natural sounding 
synthesizers, this database is also quite large. 
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TT^^he final uriit, an acoustic signal synthesis unit 18, combines the pitch, duration and phoneme 
\V information from the pronunciation unit 14 and the prosody generation unit 16 to produce the 
/ actual acoustic signal. There are two dominant methods in state of the art speech 
synthesizers. \The first is formant synthesis, in which a human vocal track is modeled and 
phonemes are ^synthesized by producing the necessary formants. Formant synthesizers are 
very small, buft the acoustic quality is insufficient for most applications. The more widely 
used high-quali\y synthesis technique is concatenative synthesis, in which a voice artist is 
recorded to prockice a database of sub-phonetic, phonetic, and larger multi-phonetic units. 
Concatenateive synthesis is a two-step process: deciding which sequence of units to use, and 
concatenating them in such a way that duration and pitch are modified to obtain the desired 
prosody. The quality of such a system is usually proportional to the size of the phonetic unit 
database. 



5 A high quality text-to-speech synthesis system thus requires large pronunciation, prosody, and 
phonetic unit databases. While it is certainly possible to create and efficiently search such 
large databases, it is much less feasible for a single user to own and maintain such databases. 
One solution is to provide a text-to-speech system at a server machine and available to a 
number of client machines over a computer network. For example, the clients provide the 

0 system with a piece of text, and the server transmits the converted speech signal to the user. 
Standard speech coders can be used to decrease the amount of data transmitted to the client. 

One problem with such a system is that the quality of speech eventually produced at the client 
depends on the amount of data transmitted from the server. Unless an unusually high 

5 bandwidth connection is available between the server and the client, the connection is such 
that an unacceptably long delay is required to receive data producing high quality sound at the 
client. For typical client applications, the amount of data transmitted must be reduced so that 
the communication traffic is at an acceptable level. This data reduction is necessarily 
accompanied by approximations and loss of speech quality. The client/server connection is 

0 therefore the limiting factor in determining speech quality, and the high-quality speech 
synthesis at the server is not fully exploited. 

U.S. Patent No. 5,940,796, issued to Matsumoto, provides a speech synthesis client/server 
system. A voice synthesizing server generates a voice waveform based on data sent from the 
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client encodes the waveform, and sends h to the c.ien.. The client then receives the encoded 
w^orm, deco.es it, and outputs it as voice. There are a number fv^^ 
Matsumoto system. First, it uses signal synthesis methods such as forman. synthesis, n 
which a human vocal track is modeled according to particular parameters. The acoustic 
qU a,i,y of formant synthesizers is insufficient for most application, Second, the Matsum to 
',em uses standard speech compression algorithms for compressing the generated 
leforms. While these algorithms do reduce the data rate, they stil. suffer the quality/speed 
tradeoff mentioned above for standard speech coder, Generic speech coders are deseed fo 
the transmission of unknown speech, resulting in adequate acoustic quahty and gracefu 
degradation in the presence of transmission noise. The design criteria are somewhat Afferent 
for a texMo-speech system in which a pleasant sounding voice (i.e., higher ftan adequate 
acoustic quality) is desired, the speech is known beforehand, and there is sufficient ume to 
retransmit data to correct for transmission errors. In addition, a client with s— CPU 
resources is capable of implementing a more demanding decompress™ scheme Given these 
different criteria, a more optimal and higher compression methodology is possible. 

There is still a need, therefore, for a client/server architecture for text-to-speech synthesis that 
outputs high-quality, natural-sounding speech at the client. 

SUMMARY 

Accordingly, the present invention provides a client/server system and method for high- 
tality texLspeech synthesis. The method is divided between the client and server such 
L the server performs steps requiring large amounts of storage, while the chent performs 
,he more computationally intensive steps. The data tmnsmitted between chent and server 
consist of acoustic units mat are highly compressed using an optimized compress.on method. 

A text-to-speech synthesis method of the invention includes the steps of obtaining a 
normalized text, selecting acoustic units corresponding to the text from a database ma, stores 
a predetermined number of possible acoustic units, transmitting compressed acousttc units 
from a server machine to a client machine, and, in the client, decompressing and 
concatenating the units. The selected unite are compressed before being transmi* ed to , h 
client using a compression method mat depends on the predetermined number of possible 
acoustic units. This results in minimal degradation between the original and received acoustic 
unite For example, parameters of the compression method can be selected to minimize the 
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amount of data transmitted between the server and client while maintaining a desired quality 
level. The acoustic units stored in the database are preferably compressed acoustic units. 



Preferably, the client machine concatenates the acoustic units in dependence on prosody data 
5 that the server generates and transmits to the client. The client can also store cached acoustic 
units that it concatenates with the acoustic units received from the server. Preferably, 
transmission of acoustic units and concatenation by the client occur simultaneously for 
sequential acoustic units. The server can also normalize a standard text to obtain the 
normalized text. 



The invention also provides a text-to-speech synthesis method performed in a server machine. 
The method includes obtaining a normalized text, selecting acoustic units corresponding to 
the normalized text from a database that stores a predetermined number of possible acoustic 
units, and transmitting compressed acoustic units to a client machine. The selected acoustic 

5 units are compressed using a compression method that depends on the specific predetermined 
number of possible acoustic units. Specifically, the compression method minimizes the 
amount of data transmitted by the server while maintaining a desired quality level. 
Preferably, the acoustic units in the database are compressed acoustic units. The server may 
also generate prosody data corresponding to the text and transmit the prosody data to the 

0 client. It may also normalize a standard text to obtain the normalized text. 

dso proviated is a text-to-speech synthesis method performed in a client machine. The client 
receives compressed acoustic units corresponding to a normalized text from a server machine, 
decompressesuhe units, and concatenates them, preferably in dependence on prosody data 
:5 also received from the server. Preferably, the method steps are performed simultaneously. 
The units are selected from a predetermined number of possible units and compressed 
according to a compression method that dependes on the predetermined number of possible 
units. For example, parameters of the compression method can be selected to minimize the 
amount of data transmitted to the client machine. The client can also store at least one cached 
0 acoustic unit that it concatenates with the received acoustic units. The client can also transmit 
a standard text to be converted to speech to the server, or it can normalize a standard text and 
send the normalized tfcxt to the server. 




^ and generates prosody da» c = -n^ a ^ „, ncatenate s th e 

combed units and the prosody data to the chent ^ ^ 

received acoustic units in dependence on me prosody Ml ™e ^P ^ 

or, the predetermined acousfc umts. Preferably, par ^ 
S e,ected to minimize the data — te d from the s .e ^ 
preferably stores compressed -"-»^ p ^^ t stores cached acoustic units 
standard text to obtain the normalized text. Preferably, 
tha, are concatenated «ith the received acousttc units. 

^ presentation also ^^^^^^^^ 
BRIEF DESCRIPTION OF THE FIGURES 

invention. 

DETAILED DESCRIPTION 
„ the folding detai ,ed —on ^^^^ 
illustration, anyone of ordmary skdl m the art w V? Accord ingly, the 

^ «n P thnd for concatenative text-to-speech 
Th e present invention ^^^^L m ^^^ 
synthesis in a client/server arctatectare The mvent, of ^ 

b ; optimizing division of the method be*— -d ^ ^ ffle 

^rr^^i- - h,gh,y „,d acoustic 
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units are transmitted to the client, and the high processing speed of the client prov,des very 
fast decompression and concatenation of the aconstic units as they are being received. Tins 
architecture allows for a very large database of acoustic units, which is required to generate 
high-quality and natural-sounding speech. 

A preferred embodiment of a system 20 of the present invention is shown in FIG. 2. The 
system 20 contains a server machine 22 communicating with a client machine 24 accordmg to 
conventional protocols such as XML, HTTP, and TCP/IP. Although only one client machine 
24 is shown, the system 20 typically contains many client machines communicating with at 
) least one server 22. For example, the server 22 and client 24 can be connected through the 
Internet or a Local Area Network (LAN). General characteristics of the system components 
are high processing speed and low memory of the client 24; high memory and low processing 
speed of the server 22; and low bandwidth communication network connecting the client 24 
and server 22. The server 22 actually has a high processing speed, but because each client has 
5 access to only a small portion of this computational power, the server appears relatively ^weak 
toeach client. It will be apparent to one of average skill in the art that the terms high and 
"low" are defined relative to state of the art hardware and are not limited to particular speeds 
or capabilities. The present invention exploits these characterizations by minimizing data 
transfer between client and server, maximizing calculation and minimizing data storage on the 
10 client, and minimizing calculation and maximizing data storage on the server. 

Methods of the invention are executed by processors 26 and 28 of the server 22 and client 24, 
respectively, under the direction of computer program code 30 and 32 stored within the 
respective machines. Using techniques well known in the computer arts, such code is 

25 tangibly embodied within a computer program storage device accessible by the processors 26 
and 28, e g., within system memory or on a computer readable storage medium such as a hard 
disk or CD-ROM. The methods may be implemented by any means known m the art. For 
example, any number of computer programming languages, such as Java, C++, or LISP may 
be used. Furthermore, various programming approaches such as procedural or object oriented 

30 may be employed. 

The invention performs concatenative synthesis generally according to the method outlined in 
FIG 1 The first three steps, text normalization 12, pronunciation analysis 14, and prosody 
generation 16, and the first step of acoustic signal synthesis 18, acoustic unit selection, are 
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u, f H hv the server 22. These steps are relatively computationally 
sounding speech. A pronunciation database 34 stores a, ieaaone o *«W 

known; i.e., the database 38 stores a pre 

quality of synthesized speech is typically J*^^^ ^ ^ not be described 
38. These steps and their associated databases are known in the art and win 



herein. 
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does not address a generic speech compression problem, but is rather directed toward the 
predetermined se, of acoustic units stored in the database 38. In contrast, standard speech 
coders must compress a continuous stream of speech, which is not divided mto fundamen al 
reusable units, and must allow for all speech possibilities, rather than a predetermined set of 
acoustic units. They therefore require significant approximations and can be qmte lossy, 
reducing the resulting speech quality significantly. A preferred compression algorithm ,s 
discussed below. 

The client computer 24 receives the compressed acoustic units and prosody data from the 
server 22 using a client transfer module 42, e.g. an Internet connection. The incommg data ts 
stored in a relatively small buffer storage 44 before being processed by the client processor 28 
according to the program code instructions 32. Preferably, particular compressed acoustic 
units and prosody data are processed by the client 24 as soon as they are received, rafter ton 
after all of the data have been received. That is, the acoustic units are streamed to the client 
24, and speech is output as new units are received. Thus for sequential acoustic un,ts, 
reception, processing, and output occur simultaneous*. Streaming is an efficient method for 
processing large amounts of data, because i, does no. require an entire file to be transmitted to 
and stored in the client 24 before processing begins. The data are not retained in memory 
after being processed and output. 

The server transfer module 40 and client transfer module 42 can be any suitable 
complementary mechanisms for transmitting data. For example, they can be network 
connections to an intranet or to the Internet. They can also be wireless devices for 
transmitting electromagnetic signals. 

Processing by the client machine includes retrieving the compressed acoustic units from the 
buffer 44, decompressing them, and then pitch shifting, compressing, or elongating the 
decompressed units in accordance with the received prosody data. These steps are relatively 
computationally intensive and take advantage of the high processing speed of the client 
processor 28. The units are then concatenated and sent to a speech output unit 46 (e.g., a 
speaker) that contains a digital-to-analog converter and outputs speech corresponding to the 
concatenated acoustic units. 
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Preferably a small number of uncompressed acoustic units are cached in a cache memory 48 
24. The processor 28 can access the cache memory 48 much more ,u,ckly man , 
«:,he client ' main memo.. Frequently used acoustic units „ cac^on ^Hen 
24 so that the server 22 does not need to transmit such units repeatedly, dear y, there a 

that are stored in the cache 48, the fewer the units that must be transmuted. The number 

capabilities The serve, 22 has information about which umts are cached on wtach chen 
m^m When the server 22 se,ec, a unit that is cached on the relevant chen, ^ 
remits only an identifier of the unit to the client, rate than the compressed unn,, self 
oZ de-pression, the chen, determines that the transmitted data includes an ,den,,f,er of 
a cached unit and quickly retrieves the uni, from <he cache 48. 

The present invention can be implemented with any compression method mat optimizes 
«si„n based on the fact that me compice se, of possible acoustic umts ,s known. The 
7IZ example of a suitable compression methcK, is intended to illustrate one pos ,b 
o!pres ion meld, but does not limit the scope of the present 

preferred embodiment of the compression method, each acous ,c un,« ,s d ™M mto 
sequences of chunks of equal duration (e.g., 10 milliseconds). Each chunk ,s desenbed by a 
rXLeters describing the frequency composition of the chunk 
model For example, the paramelers can include line spectral pans of a Lmear Pred.ct.ve 
C dmg Zo mode,. One of the parameters indicates the number of parameters used e.g 
me Ztr o line spectta, pairs used ,o describe a single chunk. The ntgher the number of 
P^s used, the more accurate will be the decompressed unit. The chunk „ regenerate 
'ZZ model and me parameter set, and a residual, the difference between the ongmal . 
generated chunk, is obtained. The residual is modeled as, for example, a set rf «*Uy 
ced impulses. The set of LPC parameters describing the full database can, ,n addmon^ 
Sid into a small se, of p— vectors, or a codebook. The same quanuzaaon can b 
piled on the residua, vectors to reduce the description of each frame to two md,es. mat 
of the LPC vector and that of the residual vector. 

Given this framework for me compression method, the method is optimized to se.ee, the 
^r of Parameters. Using a directed optimized search, me number of p— 
frequency model and me number of impulse models for me res,dual are selected. The search 
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is directed by an acoustic metric that measures quality. This metric is a combination of 
indirect measures such as a least mean squared difference between the encoded speech and 
the original, which can be used in, e.g., a gradient descent search, as well as perceptual 
measures such as a group of people grading the perceived quality, which post-qualifies a 
parameter set. The frequency model numbers and residual are then coded through an 
optimally selected codebook that uses the least possible number of code words to describe the 
known database. The indices to code words are the compressed acoustic units that are 
transmitted from the server 22 to the client 24. 

In a typical application, the client machine 24 has a standard text that it would like converted 
into speech. As used herein, a standard text is one that has not yet been normalized and 
therefore may contain punctuation, abbreviations, acronyms, numbers of various format 
(currency amounts, dates, times), email addresses, and other symbols. The client machine 24 
transmits its standard text to the server 22, which begins the process by normalizing the 
standard text as described above. Transmitting the standard text from the client to the server 
does not affect the transmission of compressed acoustic units from server to client, because 
text transmission requires relatively little bandwidth. Alternatively, the server 22 may receive 
the text and client identifier from a different source, or it may generate the text itself. 

Although the system of the invention has been described with respect to a standard 
client/server computer architecture, it will be apparent to one of average skill in the art that 
many variations to the architecture are within the scope of the present invention. For 
example the client machine can be a dedicated device containing only a processor, small 
memory, and voice output unit, such as a Personal Digital Assistant (PDA), cellular 
telephone, information kiosk, or speech playback device. The client and server can also 
communicate through wireless means. The steps performed by the server can be performed 
on multiple servers in communication with each other. It is to be understood that the steps 
described above are highly simplified versions of the actual processing performed by the 
client and server machines, and that methods containing additional steps or rearrangement of 
the steps described are within the scope of the present invention. 

It will be clear to one skilled in the art that the above embodiment may be altered in many 
ways without departing from the scope of the invention. Accordingly, the scope of the 
invention should be determined by the following claims and their legal equivalents. 
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